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Abstract 

Undirected graphical models are widely used in statistics, physics and machine vision. However 
Bayesian parameter estimation for undirected models is extremely challenging, since evaluation of the 
posterior typically involves the calculation of an intractable normalising constant. This problem has 
received much attention, but very little of this has focussed on the important practical case where the data 
consists of noisy or incomplete observations of the underlying hidden structure. This paper specifically 
addresses this problem, comparing two alternative methodologies. In the first of these approaches particle 



Markov chain Monte Carlo (Andrieu et al. 20101 is used to efficiently explore the parameter space, 



combined with the exchange algorithm ( Murray et al. 2006 1 for avoiding the calculation of the intractable 



normalising constant (a proof showing that this combination targets the correct distribution in found in 
a supplementary appendix online). This approach is compared with approximate Bayesian computation 
( [Pritchard et aZ."] [1999). Applications to estimating the parameters of Ising models and exponential 
random graphs from noisy data are presented. Each algorithm used in the paper targets an approximation 
to the true posterior due to the use of MCMC to simulate from the latent graphical model, in lieu of 
being able to do this exactly in general. The supplementary appendix also describes the nature of the 
resulting approximation. 

Keywords: approximate Bayesian computation, particle Markov chain Monte Carlo, intractable norm- 
alising constants, exponential random graphs, graphical models. 

1 INTRODUCTION 



1.1 Motivation 

The subject of this paper is Bayesian inference in models for large numbers of dependant objects, e.g. pixels 
in an image, people in a social network, or pages on the world wide web. We focus on Markov random fields 
(MRFs), also known as undirected graphical models (UGMs), which are the models most commonly used 
for this type of data. Much of the literature (e.g. Murray et al. (20061; M0ller et al. (20061) is devoted to 
estimating the parameters of these models in the case where the MRF is completely observed, but for real 
data this is often not the case: in practice observations of the MRF can be noisy or incomplete. As a result, 
estimation of the parameters of the model presents a significant computational challenge, particularly when 
using Bayesian estimation in order to account for the missing data in a principled manner. 

Let y £ y represent noisy or incomplete observations of the hidden variables x Cz X. Our aim is to 
estimate the parameters 6 G 8 of a model for this data. Using the Bayesian approach, we describe a 
joint distribution p{9,x,y) — p{9)f{x,y\9) over 6, x and y and wish to estimate 9 through the posterior 
distribution: 

p{9\y)^ f p[9)f{x,y\9)dx. (1) 



Our focus is on Monte Carlo methods for simulating from p(9\y), and we present two quite general 
methodologies for achieving this. We apply these methods to two well known types of data: a noisy image 
(similar to Besag (1986l); and an indirectly observed social network (similar to Caimo and Friel ( 2011[ )). 

Parameter estimation in problems such as these is often addressed using maximum likelihood (as in 
Heckerman ( 1996[ ) ) , rather than a fully Bayesian approach (aside from the recent work in M0ller et al.\ ( 2004 1 ; 
Murray et al. ( 2006[ ); [Friel et al. (20091; Koskinen et al. ( 2010[ )). In this paper we compare the computational 
approach taken in these recent papers, which we show is not always appropriate, to alternative computational 
methods. 



In some cases we are also interested in the posterior distribution over the hidden variables x (in the 
applications described above, this is the case if we wish to infer a denoised Ising model or social network) 
but this is not our primary focus. A part of the methodology we describe is the use of sequential Monte 
Carlo (SMC) samplers for simulating from the posterior of x, which can be significantly more efficient than 
using MCMC on these models. 

In the remainder of this section we describe MRFs in more detail in section |1.2[ and basic methods of 
inference in these model in section |1.3[ Section |1.3| also describes the challenges faced in using these basic 
inference methods, and this motivates the rest of the paper. 

1.2 Models 

1.2.1 Markov Random Fields 

Throughout the paper, we take p{6) to have some simple form that can be both evaluated pointwise and 
simulated from using standard techniques. The primary purpose of this paper is to discuss computational 
methods for inference rather than choosing the most appropriate models, therefore we choose a proper prior 
simply to ensure that the posterior always exists. 

We assume that the likelihood f{x,y\9) factorises as an MRF, i.e. the joint distribution is completely 



defined by potential functions over groups of variables known as cliques (Besag 19741: 



1 "' 

f{^,y\o) = ^7r^\{^3{^^y^c,\9) (2) 



where <I>i:Af are potentials on cliques Ci:j\/, and the normalising constant 

„ M 
Z[0) - / n *j(^'J/ e Cj\e)dxdy (3) 

is usually an intractable integral. The methods described in this paper also apply to directed graphical 
models, which is easier to deal with since the factors of the joint distribution in a directed model are simply 



conditional distributions that can be normalised independently ( Heckerman 1996 ) , and hence the intractable 
normalising constant is not present. 

1.2.2 Ising Models 

The Ising model is a particular MRF that defines a distribution over a random vector x of binary variables 
each of which can take the value —1 or 1. The distribution is defined by the pairwise interactions between 
variables as f{x\Ox) = exp Ox^a j)c.-!<i^i^j IZ{9x), where N is a set that defines pairs of nodes that are 
"neighbours" and, for our purposes. Ox > 0. This distribution follows through choosing cliques containing 
only two nodes, with the potential function ^(xi,Xj\6x) — exTp{9xXiXj) for each clique. In this paper we 
concentrate on the case where N is chosen so that the variables are arranged in a two-dimensional grid. 
The study of these models often centres around the phase change that the model exhibits when 9x passes 
a critical value: a small 6x gives a small preference for neighbouring nodes to be the same and results in 
a high probability that localised groups of variables take on the same value. When 9x increases above the 
critical value the joint distribution has the vast majority of its probability mass on the cases where almost 
every variable takes on the same value. 



The Ising model and its generalisation, the Potts model (where the variables can take on more than two 
states), are frequently used in analysing spatially structured data, especially images. In these applications 
it is usually the case that the random vector x is observed indirectly through observations y. To model the 
noisy image data used later in the paper, we assume that each x variable has a corresponding y variable 
representing a noisy observation of the x variable. We then take the joint over the x and y variables to be: 



f{x, y\6) = f{x\e^)g{y\x, Oy) = — — exp 



(iJ)6N 



1 

Wy 



exp 



\T ■ 



Oy ^ Xkyk 



k=l 



(4) 



where T is the number of pixels in the image, Z{9y) — (exp(0j^) +exp(— 0j,)) is the normalising constant 
for all of the potentials on the (cc/c, yt) pairs. 



1.2.3 Exponential Random Graphs 

This section considers the most popular model for social networks: the exponential random graph model 



(ERGM) or p* model (Wasserman and Pattison 19961. These models are also used in physics and biology 



and aim to model data consisting of nodes and edges, which in the social network context represent actors 
and relationships between these actors. There has been relatively little work on using Bayesian inference 



for inferring the parameters of ERGMs aside from recent papers by Koskinen et al. (2010 1; Caimo and Friel 



(2011). 



For an ERGM, the random vector x is defined over the space of all graphs on a set of nodes, with each 
variable in x representing the presence or absence of a particular edge (x^ takes value 1 when edge k is 
present and when edge k is absent) . An ERGM for a graph (which consists of a set of edges over the set 
of nodes) is then given by: 



fix\9,) 



1 



eMOlSix)) 



(5) 



where S{x) is a vector of statistics of the graph (e.g. the number of edges, triangles, etc) and 9^ denotes the 
transpose of the parameter vector 9x. The normalising constant is intractable due to the extremely large 



number of possible graphs. ERGMs are simply MRFs defined on the edge space of networks (Frank and 



Strauss 19861, with statistics of the network corresponding to particular cliques in the MRF. 



We consider the case where the y variables represent noisy observations about the existence of each edge 
in the network. Analogous to equation (HI) we use the model 



f{x,y\9) ^ f{x\9^)g{y\x,9y) ^ 



Z{9^) 



eMels{x)) 



Z{9y) 



exp 



,^(2a;fe-l)(2yfe-l) 



fe=i 



(6) 



We note that the algorithms described in this paper are also applicable to the alternative missing data 



scenarios described in Koskinen et al. (2010). In fact, the only specific detail that is required to use the 



methods in this paper is the factorisation of x as an MRF: the space in which 9 and y lie affects only minor 
implementation details. 



1.3 Inference 



1.3.1 Posterior Inference 

Recall that the posterior distribution of interest is p{d\y). We now consider standard Monte Carlo meth- 
ods for sampling from this distribution, under MRF models such as those described in the previous section. 
Throughout the algorithmic sections of this paper we consider the general case where both the observed vari- 
ables y and unobserved variables x are jointly distributed according to an MRF, with distribution f{x, y\9). 
The high dimensionality of x leads to the use of Markov chain Monte Carlo (MCMC) for performing the 
integration in equation (fTl). Specifically, MCMC is used to simulate from p{9,x\y), projecting the obtained 
points into the marginal space in order to obtain a sample from p{0\y). In these circumstances, since it is 
unclear as to how to propose an efficient move in the joint space of 9 and x, the standard approach is to use 
an MCMC algorithm that updates 9 and x separately through moves on their full conditional distributions 



(see Murray and Ghahramani (20041; Friel et al. (20091; Koskinen et al. (2010), for example). We shall refer 



to this method as the "data augmentation" (DA) approach. 

Superficially the DA approach may seem to be extremely simple, but for the models described here there 
are three major difficulties with using this method: 

1. Sampling from p{9\x,y) is difficult, since this requires the evaluation of the intractable normalising 
constant in equation ([3|. 

2. Sampling from p{x\9,y) can be difficult, since this is usually a distribution with a complicated 
structure on a high-dimensional space. 

3. Using a data augmentation approach may be inefficient, since x and 9 are often highly depen- 
dent a posteriori. 

These difficulties largely dictate the structure of the remainder of the paper. In section |2] we devise a novel 
MCMC algorithm for sampling from p{9\y) that is designed to minimise the effect of each of these problems. 



combining the exchange algorithm of Murray et al. ( 2006 1 with marginal particle Markov chain Monte Carlo 



(PMCMC) (Andrieu et al. 20101. In section IS] explore an alternative approach to sampling fromp(0|y), using 



approximate Bayesian computation (ABC) ( jPritchard et al. 1999 1 to largely circumvent the difficulties listed 
above. However, this approach introduces complications of its own. Section |4] applies both approaches to 
data, with a comparison to the standard DA approach. 

Throughout the paper we make use of a single site Gibbs sampler on p(x\9,y), and for this reason we 
devote some space in the remainder of this section to describing this algorithm. 



1.3.2 Gibbs Samplers for MRFs 

The simplest approach to sampling from p{x\9, y) for a MRF model using MCMC is to use single component 
Metropolis, where the variables are updated one at a time. Variable k is updated using a Metropolis-Hastings 
step targeting the full conditional p{xk\x^k,9,y) oc Jlc 32; '^ji^^y ^ Q/I^)i where the product is over the 
potentials of all cliques that contain variable k. When there are strong dependencies between the variables 
this results in an inefficient MCMC algorithm. A partial solution might be to update highly correlated 



variables in blocks: Andrieu et al. (2010) note that to create an efficient algorithm the size of the blocks of 



variables must be limited, since it becomes more increasingly difficult to design good proposals for a block as 



the size of the block grows. The inefficiency of Gibbs samplers for Ising models is well known (e.g. Higdon 



(1998)) and a similar observation is made in the literature on ERGMs (Snijders 2002 1. Similar approaches, 
with the same drawbacks, are commonly used for sampling from p{x\9), which is encountered both in the 
exchange and ABC algorithms later in the paper. The inefficiency of these approaches motivates the use of 
SMC samplers, where we observe a significant improvement over the Gibbs sampler. 



2 EXCHANGE MARGINAL PMCMC 

This section describes the design of an MCMC algorithm for simulating from p{9\y) that addresses the 



problems listed in section 1.3.1 We begin in section 2.1 by describing methods for sampling from p{9\x,y) 
that avoid the evaluation of the intractable term Z{6) in equation dsl. Our favoured approach is the exchange 



algorithm of Murray et al. (20061: an algorithm that requires exact simulation from f{x,y\9) which is not 



generally possible in the models we consider. We study the effect of replacing exact simulation with the use 
of MCMC. 

The remainder of the section focuses on the use of marginal PMCMC to sample efficiently in cases where 



9 and x are highly dependant a posteriori. In section 2.2 we introduce PMCMC and describe how to apply 
it to MRFs, using the exchange algorithm to avoid the intractable normalising constant. The key component 
of a PMCMC algorithm is an efficient SMC sampler for simulating from p{x\9, y) and we describe candidate 



approaches in section 2.3 



2.1 Intractable Normalising Constant 

Consider the use of a Metropolis-Hastings (MH) update on the full conditional p{9\x, y) ex p{9)f{x, y\9). Let 



tM 



z{e) 



^{x,y\9). Using a proposal q{9*\9), we obtain the 



lixMO) = Y{T=i^j{x,y^ Cj\9) so that f{x,y\9) 

acceptance probability: 

p{9*)^{xM9*)q{9\9*) Z{9) 

p{9UxMO)q{0*\e) z[9*y ^'' 

which cannot be calculated due to the presence of Z{.), leading such models to be known as doubly intractable 



(Murray et al. 2006 ). A similar problem is encountered when using maximum likelihood, or any other method 



that need to evaluate the likelihood. There are several different classes of approaches for avoiding this: 



pseudo-likelihood (Besag 19751; variational approximations (Murray and Ghahramani 20041; Monte Carlo 



approximations (Geyer and Thompson 1992 Green and Richardson 2002 Atchade et al. 20081; auxiliary 



variable approaches ( |M0ller et a!] |2006[ [Murray etld] |2006[ ); and ABC ( [Grelaud efai] |2009[ |. Each of the 
methods results in targeting an approximation to p{9\x, y), unless exact simulation from /(x, y\9) is available. 
We focus on auxiliary variable methods and ABC due to their ease of implementation and because we find 
that using an MCMC algorithm in place of exact simulation from f{x\9) has little practical effect on the 
examples we consider. 



2.1.1 The Single Auxiliary Variable Method 

The first MCMC method that targets the true posterior in the presence of the intractable normalising 



constant was the single auxiliary variable method (SAVM), which originated in M0ller et al. (20041 and 



M0ller et al. (20061. Here we present an alternative justification of SAVM to that in the original paper 



(given in Andrieu et al. (2010 1), based on the principle that it is possible to design an MCMC algorithm for 
simulating from some target Tr{9) when only unbiased positive estimates Tf{9) of an unnormalised version of 



IT are available. Specifically, a standard Metropolis-Hastings algorithm that targets tt, with proposal q{d*\0) 

and acceptance probability 

9{9*)q{9\9*) 

has a stationary distribution of Tr{9). Recall now the MH algorithm targeting p{6\x,y) with acceptance 
probability given in equation (l7|: equation (Is]) implies that a Metropolis-Hastings algorithm targeting an 
unbiased estimate p{9\x,y) :— p{9)j{x,y\9)l/Z{9) of p{9\x,y), where 1/Z{9) is an unbiased estimate of 
1/Z{9), will have a stationary distribution oi p{9\x,y). 

SAVM targets a joint distribution Tr{9,u\x,y) oc q-a{u\9,x,y)p{9)j{x,y\9)/Z{9), of which p{9\x,y) is a 
marginal, constructed through the introduction of an auxiliary variable u, where q-[]i{.\9 , x , y) is an arbitrary 
distribution and u — {ux^Uy) £ X 'xY . This target is then simulated using a Metropolis-Hastings algorithm 
on the {9,u) space, with proposal 6* ^ q{-\9) and u* ^ f{-\9*) (the simulation of u* being performed using 
exact sampling). The acceptance probability for this move is given by: 

. . p{9*h{x,y\9*)q{9\9*)quiu*\9*,x,y)jiu\9) 

p{9)j{x,y\9)q{9*\9)qu{u\9,x,yh{u*\9*) ' ^ > 

Comparing equations (|7| and (l9l we see that essentially SAVM is using single point importance sampling 
(IS) estimates of 17^(6') and 1/Z{9*), where T/Z{9) = q^{u\9,x,y)/-f{u\9) and 1/Z{9*) = qu{u*\9* ,x,y)/jiu*\9*). 
Linking this to the argument based on equation (Isl gives us an alternative justification of the algorithm which 
suggests the following generalisation: any algorithm that finds an unbiased estimate of 1/Z{9) may be used 
in place of IS. For example, an SMC sampler with tti = f{.\9) and ttt — qu{u\9,x,y) will provide such an 
estimate whose efficiency is less dependent on the choice of g„ than is the IS estimate. Note that SMC 
samplers play an important role later in the paper, and we refer the reader unfamiliar with the method to 



Del Moral et al. (20061 for further details 



2.1.2 The Exchange Algorithm 



In Murray et al.\ (20061, the exchange algorithm is initially motivated by the observation that in SAVM 



the ratio Z{9)/Z{9*) is estimated indirectly, using the ratio of l/Z{9*) and 1/^(6'). [Murray et a?.| ( |2006 l 
suggests that estimating Z{9)/Z{9*) directly may lead to an improved algorithm. The resulting algorithm is 
actually most clearly formulated (for our purposes) in an alternative fashion. Consider the target distribution 
t:{9,9* ,u\x,y) = f{x,y\9)p{9)q{9*\9)f{u\9*)/p{x,y) of which Tr{9\x,y) is a marginal, where u — {ux,Uy) e 
X X y. At iteration z, the exchange algorithm iterates the following two operations (given input {9{i — 
l),0*(i — l),u(j — 1)) = {9,9*,u) from the previous iteration), giving an MCMC algorithm that targets 
TT{9,9*,u\x,y): 

1. Draw 9* - q{.\9{i - 1)) and u - /(.|r). 

2. Let {9{i),9*{i),u{i)) ^ {9* ,9{i - l),u) with probability: 

Pi9*h{x,y\9*)qi9\9*h{u\9) Z{9)Z{9*) ^ p{9*)^{x,y\d*)md*)l{u\9) 
p{9h{xM0)q{e*\9h{u\9*) Z{9*)Z{9) p{9UxM0)q{e*\9h{u\9*) ' ^ "^ 

otherwise set (9{i),9*{i),u{i)) = {9{i - l),9*{i - l),u). 



This deterministic move simply proposes a swap of and 6*, hence the name "exchange algorithm". The 



proof that this is the appropriate acceptance probability for this move is given in Tierney (19981, where 
it is established that the acceptance probability for a deterministic move 9 — >■ T{9) on a target tt is 1 A 
TT{T{9))/Tr{6). If we now compare equation Q with equation (11 1 we see that 'y{u\9)/^{u\6*) can be thought 
of as an IS estimate of Z{9)/Z{9*). so that the exchange method can be interpreted as using an estimate 
of the acceptance probability. However, we note that in this case we only have proof that this estimate in 
particular results in an algorithm that targets the correct distribution - alternative estimates that give a 



better estimate of this ratio may not be appropriate (although the similar approach of Koskinen ( 2008 1 is 
also shown to have the correct target). The exchange method exhibits superior performance to SAVM in 



Murray et al. (2006), and it is also easier to implement - in SAVM there are more algorithmic choices to 



make and these can severely affect the performance of the algorithm. For these reasons we focus on the 
exchange algorithm for the remainder of the paper, although we note that if the free choices in SAVM are 



well made (as in M0ller et al. (2004l), it may outperform the exchange method. 



Although the exchange algorithm targets the true posterior, it is less efficient than a standard MH algo- 
rithm would be if the true target were available. This inefficiency is accentuated if the proposed u ~ f{-\9*) 



has only a small probability of being generated under 9. Murray et al. (20061 describes an extended ex- 



change method to improve this inefficiency whilst maintaining the exactness of the algorithm, using annealed 
IS ( Neal[[200T l as a substitute for the importance estimate ^{u\9)/j{u\9*). In the applications described in 
section H] we found the use of this extended method to be essential in obtaining a reasonable performance 
for the methods described in the paper. The central idea of the extended method is to move the proposed 
u ~ f{-\9*) through a sequence of transitions so that it has a larger probability of being generated under 9 (a 
similar idea may be used to improve SAVM in a similar way) . This is done by introducing a sequence of K tar- 
get distributions fki-\9,9*) oc -fki.\9,9*) = -/{.\9*)f^>'-/{.\9y-^'' where, for example, (3k = (K - k + l)/{K +1) 
so that the sequence of targets provides a "route" from f{.\9*) to f{.\9). The extended exchange algorithm 
moves the initial point u ^ f{-\9*) via a sequence of transitions Rk{u'\u, 9, 9*) that satisfy detailed balance 
with fk{u'\9,9*), and changes the acceptance probability accordingly: 

1. Draw 9* - q{.\9{i - 1)) and uq - fi.\9*). 

2. Apply the sequence of transitions: ui ^ Ri{ui\uq, 9, 9*), U2 ^ R2{u2\ui, 9, 9*), ... , uk ^ RKiuK\uK-i, 9, 9*). 

3. Let {9{i),9*{i),u{i)) = {9*,9{i - 1),uk) with probability: 



lA 



p{9*h{x,y\9*)qi 



^k+i{uk\9,9*) 
p{9hix,y\9)q{9*\9) ^J-^ jkiuk\9,9*) ' 



K 

n 



(11) 



otherwise set {9{i),9*(i),u{i)) = {9{i -l),9*(i ~1),uk)- 



For our applications we have the factorisation /(x, j/|0) = f{x\9)g{y\x,9), where f{x\9) = ^{x\9)/Z{9) 
has an intractable normalising constant and g{y\x, 9) is normalised. In this situation the exchange algorithm 
is simplified slightly since Uy does not need to be simulated. 

2.1.3 The Exchange Algorithm Without Exact Simulation 

In step 1 of the exchange algorithm, on sampling u, exact simulation from the likelihood is required. However, 
aside from a few special cases (one of which is the Ising model, which can be sampled exactly using "coupling 



from the past" (Propp and Wilson 19961) this is not generally possible for MRFs. Caimo and Friel (20111 
choose to approximate the exact simulation by sampling u from f{.\6*) using an MCMC run that is "long 
enough" to get a point that can be treated as if it were simulated exactly from f{.\6*). We refer to this 
approach as the approximate exchange algorithm, and use M to denote the number of iterations of the 



MCMC algorithm used to simulate approximately from the likelihood. Caimo and Friel (2011) suggest that 



500 iterations is a long enough run for models similar to those studied in this paper, a conclusion supported 
by own study which suggests that as few as 50 or 100 iterations are usually sufficient. In appendix B in 
the supplemental materials we give theoretical justification for the validity of this approach, proving that 
(using a similar method to a proof in Andrieu and Roberts ( 2009[ )) when the MCMC kernel for the exact 
exchange algorithm is uniformly ergodic, the invariant distribution (when it exists) of the corresponding 
approximate exchange algorithm becomes closer to the "true" target (that of the exact exchange algorithm) 
with increasing M. We also characterise the rate of convergence of the approximate kernel. The same proof 
justifies the use of an MCMC kernel as a substitute for simulating exactly from the likelihood within SAVM. 



2.2 Marginal PMCMC 

2.2.1 The Pseudo- Marginal Approach 



Our primary concern in this section is problem 3 in section |1.3.1| that the data augmentation approach to 
obtaining a sample from p{0\y) is inefficient when 9 and x are highly dependant a posteriori. The phase 
change in Ising models means that this dependence is particularly clear in these models, but other MRFs 
(including ERGMs (Snijders 2002[)) also have this property. Beaumont (20031 and Andrieu and Roberts 



(2009) describe an alternative approach to sampling from p{9\y) that is designed to avoid the problems 
caused by this dependence: approximating the "ideal" algorithm that uses an MH update by replacing p{9\y) 



with unbiased estimates of the form p{9\y) to p{9\y). The argument described in section 2.1.1 tells us that 



such updates actually provide us with points from the desired target p(9\y). Such an approach is referred to 
as a pseudo-marginal approach. The efficiency of the MCMC chain based on these updates depends on the 
variance of the estimator p{9\y). 

The simplest useful estimator p{9\y) is an IS approximation p'^ i9\y) = j^^k^iP{9,x'^^'^\y)/q{x'^^''\6), 
where x^^> ^ ^(.l^). However, for the applications about which we are interested, where x is high-dimensional, 
it is difficult to define a proposal distribution q that results in an estimator with a small variance. The optimal 



proposal is p{x\6, y) (see (Geyer , 2011 ), for example): which we cannot sample from directly. The remainder 



of this section is devoted to using SMC samplers for this task. The framework of PMCMC in [Andrieu et al.\ 



(|2010|) then tells us, via the pseudo-marginal approach, how to use SMC samplers to more efficiently obtain 

we 



samples from p{6\y). In section 2.2.2 we describe the marginal PMCMC framework, then in section 2.2.3 
describe the use of marginal PMCMC on MRFs. 



2.2.2 Marginal PMCMC 



As in Andrieu et al. ( 2010 1 , in this section we make the assumption (relaxed in the following section) 
that the normalising constant Z of f{x, y\0) is independent of 9 so that the joint posterior is given by: 
p{9,x\y) = p{9)'y{x,y\9)/Zp{y). We describe the marginal PMCMC algorithm briefly here - for a thorough 
description see Andrieu et QZ.|p010[ ). The algorithm is an MCMC sampler that targets p{9,x\y), operating 
on the factorisation p(9\y)p{x\9 , y) . The intuition behind the approach is in devising a move on the joint 
{9,x) space: in terms of moving around the 9 space it would be most efficient if the proposal could take 



9 



the form q{9* \9)p{x* \d* , y) , so that x* is "perfectly adapted" to the proposed 9*. The approach is to devise 
an algorithm that approximates this idealised situation, using an SMC sampler as a statistically efficient 
method for simulating from an approximation p{x*\9* ,y) to p{x*\9*,y). The point from p(x*\9*,y) is then 



used as a proposed point in a Metropolis-Hastings algorithm that targets p{x*\9*,y). Andrieu et al. (20101 
derive the acceptance probability that should be used by explicitly writing down the target and proposal 
distributions involved. The resulting algorithm in the case we describe here proceeds as follows. 

At iteration i, beginning with the point {9{i — l),x{i — 1)) and the estimate 0(i — 1) outputted from 
iteration i — 1: 

1. Simulate 9* ^ q{.\9{i — 1)). where q is some proposal distribution. 

2. Run an SMC sampler on the x space, with the final (unnormalised) distribution as t:t{x) = "/{x,y\9*). 
This gives a particle approximation p{x\9* , y) to p{x\9* , y) and an estimate 0(0*, y) of its normalising 
constant, (t){9* ,y) = Zp{y\9*). 

3. Sample a single point x* from p{.\9* ,y). 

4. Set {9{i),x{i),'${i)) = {9* ,x* ,'${9* ,y)) with probability 

p{9*)^i9*,y)q{9{t-l)\9*) 

p(9)^^-l)q{0*m-l)y ^ ' 

otherwise set {9{i),x{i), (j>{i)) — {9{i — l),a;(i — 1), (f){i — 1)). 

Here note the interpretation of the algorithm as a Metropolis-Hastings algorithm targeting an unbiased 
approximation to p{9\y) ex p{9)(j)[9^y) = p{9)Z J f{x,y\9)dx. 

2.2.3 Exchange Marginal PMCMC Algorithm 

Now consider the application of marginal PMCMC to cases when the normalising constant of f{x\9) is a 
function of 9, and cannot be evaluated. In this case the joint posterior is p{9,x\y) = p{9)^{x,y\9)/Z{9)p{y) 
and direct application of the marginal PMCMC algorithm results in the presence of the intractable term 
Z{9)/Z{9*) in the acceptance ratio. The combination of marginal PMCMC with the exchange algorithm res- 
ults in the disappearance of this ratio. Let us introduce the target 7t{9,9* ,u\y) — p{9)p{y\9)q{9*\9)'^{u\9*)/Z{9*)p{y), 



where u — {ux, Uy), as in section 2.1.1 We then use the exchange algorithm as follows. 



1. Draw 9* - q{.\9{i - 1)) and u - fi.\9*). 

2. Run an SMC sampler on the x space, with the final (unnormalised) distribution as nxix) = "f{x,y\9*) 
in order to obtain the particle approximation p{x\9*,y) and an estimate (f>{9*,y) of its normalising 
constant, (j)i9*,y) = Z{9*)p{y\9*). 

3. Sample a single point x* from p{.\9* ,y). 

4. Let {9{i),9* {i),x{i),'${i),u{i)) = {9*,9{i~ l),x* ,(^{9* ,y),u) with probability: 

^^^^ Pi9*)^i9*,y)q{9\9*h{u\9) Z{9)Z{9*) ^ ^ ^^^ p{9*)l{9* ,y)q{9\9*)^{u\9) 
p{9)^{i-l)q{9*\9)-i{u\9*)Z{9*)Z{9) p{9)l{i - l)q{9*\9)-i{u\9*y 

otherwise set {9{i),9*{i),x{i),(j){i),u{i)) = {9{i — 1) , 9* {i — 1) , x{i — 1), <?!)(« — l),u). 
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The acceptance probability in equation ( 13 1 is again derived using a proof similar to those in Andrieu et al. 



( 2010 1, in conjunction with the result on deterministic transformations, from Tierney ( 1998 ), used previously. 



This proof can be found in appendix A in the supplemental materials. An extension using the extended 



exchange algorithm described in section 2.1.2 is trivial. The results in [Andrieu et al. (2010) also tell us that 
the {9,x) points that are generated are from the joint posterior p{9,x\y), and that the unused SMC points 
generated from p{.\9* ,y) can be recycled in Monte Carlo estimates based on the x space. 

Our framework is now in place: the EMPMCMC algorithm addresses each of the issues raised in section 
|1.3.1| The efficiency of the algorithm is heavily dependant on the design of the SMC sampler on the x space, 
and we examine this issue in the following section. 



2.3 SMC Samplers for MRFs 

The use of SMC samplers for simulating from hidden Markov models is well understood, but there have 
been few attempts to use them on more general graphical models. The exception is the work of |Hamze| 



and de Freitas ( 2005 1 , which they describe in application to discrete or Gaussian models with a pairwise 



factorisation. It is their methods, hot coupling and tempering, that form the basis of our approach. The 
most fundamental choice in the design of such an algorithm is the choice of targets 7ri:jv to use, where the 
first target is easy to simulate from, the final target is the desired distribution and the targets between these 
two provide a "route" from the first target to the last. It is this choice, rather than matters such as designing 
the forward and backward kernels that we focus on here. 

Hot coupling proceeds by setting tti in the SMC sampler to be a spanning tree of the true graph (which 
can be easily sampled, and whose normalising constant may be exactly calculated using [Carter and Kohn] 



(1994)) and then to add edges to the graph, one at a time (randomly chosen), until the true graph is 



reached at ttn (this scheme could be generalised to non-pairwise MRFs by forming larger cliques as the 
SMC sampler progresses). Tempering consists of choosing the sequence 7r„ = /(a;, y|0)-^/*", where {tn)n=i 
is a decreasing sequence of "temperatures" with tjy — 1. For an Ising model this sequence of distributions 
has a simple interpretation, since we obtain /(x, y|^2;,^j^)^'*" oc f{x,y\dx/tn,0y/tn) 



Hamze and de Freitas 



(2005) suggest that hot coupling is often more effective: our own empirical results support this, as long as 



data generated from the initial tree is a good approximation to data generated from the final target. This 
is difficult to quantify in advance in general: the case of the Ising model illustrates this, where the use of 
a square lattice exhibits the phase change described previously. Our empirical investigation was based on 
results obtained in our two applications in section |4j In the Ising model in the first application, we observed 
that data generated from an initial tree had similar characteristics to that that from the full grid (although 
this may not hold as strongly for larger grids). In the social network application that follows, we did not 
find that there was a smooth transition between the initial tree and the final MRF. A possible cause is that 



in this case the latent MRF has the circular structure exhibited in Frank and Strauss (19861, and that the 



addition of edges that complete this circle change the character of the data drawn from the graph. For this 
reason, the tempering method was preferred in this latter application. The tempering method is likely to 
perform more reliably across a range of applications (it is also regularly used in non-MRF applications of 
SMC samplers), but for some MRFs hot coupling can be a useful tool. 
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3 APPROXIMATE BAYESIAN COMPUTATION 



3.1 Introduction 



Approximate Bayesian computation (ABC) is a method designed, by Pritchard et al. (19991, to perform 



Bayesian inference in cases where a hkehhood l(jj\0) cannot be evaluated because it is unavailable or compu- 
tationally intractable. The basic idea is to approximate the likelihood by le{y\d) — J , l{y'\0)n^{y'\y)dy where 
Tr^{y'\y) is a probability density on the same space as y, and centred on y with a concentration determined by 
e. If e = 0, then le{y\6) is equal to the true likelihood l{y\0). This approximate likelihood can be evaluated 
by a Monte Carlo approximation le{y\0) ~ 4 X]r=i ^c(y'^'^''ly) where y''^'') ^ '(-1^) ^^nd R > 0. This likelihood 
gives more weight to values of 9 that are more likely to generate data similar to y. 

In practice, due to the possible high dimension of the data, a summary statistic S{y) is usually used 
in place of the data, giving a further approximation (if the statistic is not sufficient) to the true likelihood 



using: 



k^siy\0) := US{y)\9) := / l{S{y')\0)7r,{S{y')\Siy))dy\ 

Jy' 



(14) 



MCMC (Marjoram et al. 20031 and SMC samplers (Sisson et al. 2007 Del Moral et al. 2011 Robert 



et al. 2011 1 have been introduced to enable efficient exploration of the 9 space when using an ABC ap- 



proximation to the likelihood. In section |4] we apply an ABC-SMC sampler to the problem of parameter 
estimation in MRFs. The main advantage of this approach is that it avoids the difficulties listed in section 

Mm 

For some of the models we consider we can take e = 0, and low-dimensional sufficient statistics sometimes 
exist: in these cases it can be possible to devise an ABC algorithm that targets the correct posterior 
distribution p{9\y). However this is not the case in general. For higher dimensional sufficient statistics it 
can require more computational effort to obtain a useful sample when e = 0. In some cases using e = is 
impracticable, and sufficient statistics are not available, and in these cases it is only possible to obtain a 
sample from an approximation to the true posterior with ABC. Usually the use of ABC methods involves a 
trade off between the degree of approximation to the true posterior and the computational effort required 
to obtain the sample. Thus tuning ABC can be difficult and it is not easy to quantify the effect of the 
approximation to the true posterior that is used. The intricacies of using ABC are discussed further in the 



review of Marin et al. (20111. 



3.2 Application to MRFs 



ABC has previously been applied to inferring the parameters of MRFs ( Grelaud et al. 2009 1 - here we instead 



consider noisy or incomplete data. It is particularly simple to apply ABC to the models that we focus on in 
this paper, since both Ising models and ERGMs are defined in terms of statistics of the data. However, as 
is the case when using the exchange algorithm on these problems, it is not possible to exactly simulate from 



l{.\9) and we consider the effect of using for example, MCMC (as in Grelaud et al. (2009)) as a substitute 



The proof in appendix B in the supplemental materials describes the effect of using an MCMC run of length 
M for simulating from l{.\9) within an ABC-MCMC algorithm (as in Marjoram et al. (20031). The result is 
analogous to that obtained for the exchange algorithm and SAVM: the invariant distribution (when it exists) 
of this method becomes closer to the "true" target (the invariant distribution of the standard ABC-MCMC 
algorithm) with increasing M. 
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In our models f{x, y\d) factorises as f{x, y\d) = f{x\9)g{y\x, 9). For a given 6 the simulation y' from the 
likelihood l{-\0) is performed through first simulating x' from f{.\9), then by simulating y' from g{-\x' , 6). In 
this situation we note that x' is always proposed from its prior f{x\9), as opposed to the posterior p{x\9, y), 
therefore when the effect of the data dominates that of the prior the exploration of the x space is not likely 
to be as efficient as that used as in the PMCMC approach described in the previous section. 



4 APPLICATIONS 



4.1 Methods 



In this section we apply each of the methods described in the paper to two different data sets. The configur- 
ation of the algorithms that we use has some commonality between each case, thus we begin by describing 
these common aspects of the algorithms before describing the specifics of their application in the relevant sec- 
tions. Our implementation is in Matlab, and makes use of the UGM package (available from Mark Schmidt's 
website at http://www.di.ens.fr/~mschmidt/Software/UGM.html). 

4.1.1 Approximate Bayesian Computation 

The choices made in constructing the approximate likelihood in our ABC algorithms are always the same, 
up to the choice of summary statistics: we use the uniform kernel T:^{S{y')\S{y)) oc '^\\s(y')-s(y)\\<e £^s the 
data comparison function, where ||.|| is the Euclidean norm. 

Our choice of the uniform kernel allows us to use the adaptive ABC-SMC sampler of |Del Moral et oT] 



(2011 ) to generate weighted points from the posterior. This method adaptively chooses a sequence of targets 



in the SMC sampler, where the n'th target is the ABC posterior using the likelihood in equation (14 1 with 
tolerance e„. The sequence (e„) is chosen by, for each n, taking the smallest tolerance that ensures that a 
particular percentage of the particles is given non-zero weight. We always use 10000 particles, initialise the 
ABC-SMC with ei = 20 and terminate it when e„ is reduced to zero, and resample at every iteration (which 
is likely to be the most efficient option when using this ABC algorithm, since otherwise particles with zero 
weight are carried from one iteration of the sampler to the next) using stratified resampling. This method 
uses an ABC-MCMC move as the forward kernel within the SMC sampler, and we choose a MH move with 
a random walk proposal. 

4.1.2 DA and PMCMC 

In our implementation of DA, the update of 9 uses an MH step whose proposal g is a random walk with 
variance si, where s is some problem specific scaling and / is the identity matrix of the appropriate dimension. 
To update x we use L sweeps of the single site Gibbs sampler. 

Our configuration of the marginal PMCMC algorithm is chosen to facilitate easy comparison with the 
DA approach above. The forward kernel in the SMC sampler is always chosen to be the single site Gibbs 
sampler, and stratified resampling was performed when the effective sample size (ESS) dropped below 0.5 
multiplied by the number of particles. To ensure that the computational effort of a PMCMC iteration is 
comparable to a sweep of the DA algorithm, we take the number of particles P = YL/T\ , where T is the 
number of targets used in the SMC sampler within the PMCMC. In every case, our figures show 4500 points 
generated by the algorithm, after a burn in of 500 iterations. 
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4.1.3 Simulation From the Likelihood and Exchange Algorithm 



The extended exchange algorithm (section 2.1.31 is used when updating 6 in both the DA and PMCMC 
algorithms; B bridging targets are used. Both this use of the exchange algorithm and the ABC approach 
require simulation from the likelihood f{x\9). In all cases we use 1000 sweeps of the single site Gibbs sampler 
described in section |1.3.2| Note that the proof in appendix B in the supplemental materials indicates that 
there is no restriction on choice of the initial point xq of this Gibbs sampler. We simply chose xq = y (which 
we are free to do since in this case x and y inhabit the same space). This choice, where the same initial value 
is used at each iteration, could potentially be improved by choosing the initial value based on the previous 
run of the Gibbs sampler, however this has little practical effect in our applications. Changing the initial 
value for each run of the Gibbs sampler necessitates an alteration to the proof in the appendix, and this 
alteration is described in the appendix. 

We note that all the methods contain the single site Gibbs sampler (either for simulating from the 
likelihood, or for updating x), which we know to be inefficient. More efficient MCMC moves could be used 



in its place, for example the Swendsen-Wang algorithm for Ising models (Higdon 19981 or, for the ERGM 



example in the next section, the "tie no tie" sampler used in Caimo and Friel (20111. This would improve 
the efficiency of all of these algorithms, and for real applications we would advocate such an approach. Here 
we have chosen not to investigate these potential efficiency gains, concentrating instead on the impact of 
avoiding the use of the DA algorithm. 

4.1.4 Reporting of Results 

The difference in performance of the algorithms in both applications is large and can be observed through 
visualising the samples that are generated. Estimates of posterior means and variances do not provide 
any useful information above plots of the samples, and measures of the efficiency of the MCMC such as 
autocorrelation times are not informative since some of the chains we wish to compare are not close to 
stationarity. Thus we chose to only represent the output of the samplers through plots of the samples they 
produce (in the ABC-SMC algorithm the positions of the equally weighted particles after resampling are 
shown). 

4.2 Ising Model 

In this section we apply the methods described in the paper to inference of the parameters of an Ising model. 
We use noisy observations of a hidden 10 x 10 pixel two-dimensional grid, simulated from the model in 
equation Q with parameters 0^ = 0.1 and 6y — 0.1. This data is shown in figure la Our prior over both 9^ 



and 6y is uniform on the interval [0, 3], and our model for the data is given by equation Q. Our goal is to 
simulate from the posterior p{dxi 0y\y)j ^-^d in this section we compare ABC, DA and PMCMC algorithms. 
Our ABC algorithm used two summary statistics of the data: Si{y) = ^(j j)eTSiyiy3 (^^^ number of 
equivalently valued neighbours) and S2{y) = J^iVi (tti6 magnetisation). In the adaptive ABC-SMC al- 
gorithm, we choose that 70% of the particles were given non-zero weight at each iteration. Points from 



p{dx,9y\y) using this method, which was terminated at n — 12, are shown in figure lb the posterior mass 
is distributed around the areas where 0^ is small and/or 0y is small. It is clear that this posterior will pose 
a significant challenge to DA: in order to explore the posterior fully it is necessary to sample both large and 
small values of 0^, and moving from one side of the critical value of 9x to the other is difficult due to the 
dependence between x and 0x. Points from p{9x:(^y\y) using DA, in which we chose s = 1 and L = 10'°^, 
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are shown in figure Ic The posterior dependency between 6^ and x, and the inefficiency of the updates on 
p{x\9^y) (even when using 10^ sweeps of the Gibbs sampler at every iteration), inhibit the sampler from 
moving below the critical value of O^ . 

In this implementation of the marginal PMCMC algorithm, we use hot coupling for the SMC sampler 
updates, adding a single edge at each target (this results in a total of 82 targets, giving approximately 1200 
particles). The simlated points are shown in figure [ld| the sampler has explored the whole posterior, with no 



evidence of difficulty in passing the critical value of Ox- Figure le shows the trace plot of Si{x) for both the 
DA and the PMCMC methods, illustrating that the PMCMC approach enables exploration of the x space, 
whereas DA only explores graphs that correspond to a large value of O^- In both the DA and PMCMC 
extended exchange algorithms, we use B = 100 intermediate targets in the annealed IS: fewer intermediate 
targets can result in poor estimation of the ratio of normalising constants, and therefore inefficiency in both 
algorithms. If the original exchange algorithm is used (_B = 1), the results from the PMCMC method do not 
look dissimilar to those produced by the DA method: when there is a proposed change in 6^ that crosses the 
critical value, the move is rejected since the proposed u ^ f{.\6*) has a small probability of being generated 
under 6. 

The dominating factor in the computational cost of each algorithm is in the Gibbs sampler for the 
simulation from the x space (with the target of either p{x\6) or p{x\6,y)). For each iteration of the DA or 
PMCMC algorithm 10^ sweeps of the Gibbs sampler are performed at each iteration of the MCMC (along 
with an additional 10^ for the exchange algorithm), so the total cost of our run is approximately 5 x 10* 
sweeps. For each target of the ABC-SMC 10^ sweeps are performed, so the total computational cost is 

1.2 X 10* sweeps. 

4.3 Social Network Data 

We consider the application of our methods to the Florentine family business graph studied in |Caimo and| 



Friel (20111), shown in figure 2a We begin by assuming the network x is directly observed and use exactly 



the model for the data as that used in Caimo and Friel (20111: using an ERGM (equation ([5|) with 



Si{x) — X]i<i -"^ij (tl^6 number of edges) and S2{x) — X]i<j<fc ^ikXjk (the number of 2-stars) , and prior on 
e., = [91,92) as 9., ^ UiQ.ZQh). 



We begin by examining ABC as a direct alternative to the MCMC approach inlCaimo and Friel (20111 



(the DA and PMCMC approaches are not applicable here since there is no latent space). We use the statistics 
Si and 5*2 as our summary of the data, and since these statistics are sufficient, when e = the ABC posterior 
is equivalent to the true posterior. In this implementation the scale of the target changes dramatically across 
the iterations (since the posterior is significantly tighter than the prior), thus we adaptively choose the 
proposal variance in the MH move, so that at iteration n, the variance is 2E„_i where S„_i is the sample 
variance of the particles at iteration n — 1 (as in Robert et al. (2011[)). We choose that 50% of the particles 



were given non-zero weight at each iteration. The ABC-SMC terminated at n = 16, and weighted points 



drawn from p{9i, 92\y) using this method are shown in figure 2b These points are in good agreement with 



the shape of the posterior shown in Caimo and Friel (20111. We note that the "population" nature of the 



SMC algorithm acts as a substitute for the population MCMC method employed in that paper. 

Now consider the case where it is assumed that the observed graph, now denoted by y, is a noisy 
observation of some underlying graph x. Specifically, we use the model in equation (pi, which account for 
noisy observations of the edges of the underlying graph. We use a generalisation of the model described 
above, defined on the extended parameter 9 = {9i,92,9y) with the same priors on 9i and 6*2, and with a 
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Figure 1: Results for Bayesian parameter estimation in a hidden Ising model. 



(a) Observed data. 



(b) Points from the posterior produced (c) Points simulated from the posterior 




by the ABC-SMC algorithm. 
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using the DA exchange algorithm. 



(d) Points simulated from the posterior 

using the exchange marginal PMCMC (e) Trace plot of the statistic Si{x), the number of identical neighbours in the hidden 

algorithm. field x, for the DA (black) and the PMCMC (grey) algorithms. 
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prior of 9y ~ U[0, 100] on the additional parameter 9y. Note that the data itself does not suggest that this 
model is particularly suitable: it is chosen simply to highlight the computational problems that can result in 
the presence of a latent ERGM. We use the same ABC-SMC algorithm as above, using the same summary 
statistics (which are now not sufficient). The ABC-SMC again terminated at n = 16, and weighted points 



simulated from p(0i, 92\y) using this method are shown in figure 2c One of the limitations of ABC is evident 



here: since the statistics are not sufficient, it is difficult to assess how accurate the approximation to the 
true posterior is, even though we have e = 0. We applied the DA algorithm and PMCMC algorithms to 
the same model that assumes the data is noisy (including the parameter 9y). In the DA algorithm we chose 
s = 10 and L = 10'* and points generated from p{9i,d2\y) are shown in figure 2d The posterior sample 



produced by this sampler was highly dependant on the initial point. In this particular example, when the 
initial value of x is the complete graph (where all edges are present), the chain gets stuck in this state, which 
is highly correlated a posteriori with a small value of 9y. Again, the inefficiency of the updates on x and the 
dependency between x and 9 is seen to lead to the poor performance of the DA approach. 

In addition to comparing the PMCMC approach to DA, we also compare the original IS based pseudo- 
marginal approach to the more general PMCMC approach in order to observe the benefit of using an SMC 
sampler in sampling the x space. Specifically, we compare IS with 10"' importance points drawn using a 
uniform distribution independently on each edge, with the SMC sampler with tempering using 1000 targets 
and 10 particles. Figures [2e| and [2f| respectively illustrate the PMCMC results using these two schemes. The 
performance of the IS based technique is poor since the importance proposal is poor and does not provide 
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an accurate estimate of the normalising constant Z{9)p{y\6). The performance of the tempering SMC based 
EMPMCMC is significantly better than both the IS and DA approaches, with convergence to the region 
shown in figure |2| regardless how the MCMC is initialised. This lack of dependence on initial conditions, 
and the free exploration of the x and 9y space that is allowed by the method, provide some evidence that 
this method has found the true posterior. 

In the DA and PMCMC extended exchange algorithms, we use 1000 intermediate targets in the extended 
exchange algorithm: again we find that fewer intermediate targets results in poor estimation of the ratio of 
normalising constants and thus inefficient MCMC algorithms. 

Again, the dominating factor in the computational cost of each algorithm is in the Gibbs sampler for the 
simulation from the x space. In this case, for each iteration of the DA or PMCMC algorithm 10** sweeps 
of the Gibbs sampler are performed at each iteration of the MCMC (along with an additional 10^ for the 
exchange algorithm), so the total cost of our run is approximately 5 x 10^ sweeps. For each target of the 
ABC-SMC 10^ sweeps are performed, so the total computational cost is 1.6 x 10* sweeps. The method of 



Caimo and Friel (20111 is relatively cheap compared to ABC-SMC, with the only simulation from the x 
space being carried out as part of the exchange algorithm (which costs 10'^ sweeps). 



Figure 2: Results for Bayesian parameter estimation in a hidden ERGM. 



(a) Observed data. 



(b) Points from the posterior (with 
no observation error) produced by the 
ABC-SMC algorithm. 



(c) Points from the posterior produced 
by the ABC-SMC algorithm. 
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(e) Importance sampling based pseudo- (f) Points simulated from the posterior 
(d) Points simulated from the posterior marginal approach, with the exchange using the exchange marginal PMCMC 
using the DA exchange algorithm. algorithm. algorithm containing an SMC sampler. 
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5 DISCUSSION 

Bayesian parameter estimation for latent MRFs can face computational difficulties using standard meth- 
odology since the DA approach can be extremely inefficient. We have described two methods, ABC and 
PMCMC that can offer an alternative in situations where DA is not suitable. 

ABC is currently particularly popular for addressing missing data problems, and we apply it for the first 
time to inference in ERGMs as a method for avoiding the intractable normalising constant. We also provide 
a theoretical justification for the use of MCMC for inexact simulation from the likelihood, as is required in 
many MRF models, within ABC-MCMC. However, we observe the usual limitations of ABC in cases where 
sufhcient statistics are not available: namely that an approximation that is difficult to quantify is introduced. 

Marginal PMCMC offers an effective means of bypassing the potential inefficiency of DA and our results 
indicate that this method is a promising approach to parameter estimation in these models. We note 
the large computational cost of these methods, especially when sampling from a high dimensional latent 
space. As such, routine use of the method will usually only be possible when accompanied by an efficient 
implementation, possibly on parallel computing architectures such as graphics cards or cloud computing 
resources. Given the increase in popularity of this hardware, the PMCMC methodology offers a promising 
avenue for use on more realistic applications in the near future. The key aspect of PMCMC is the use of 
an SMC sampler for sampling from the x space. Although we have focussed on parameter estimation, our 
results (and those of Hamze and de Freitas (2005|)) indicate that SMC samplers have an important role to 



play in simulating from MRFs, both when the parameters are known and unknown. These methods have not 
been used before in the ERGM literature, and address some of the known problems with MCMC described 
in 



Snijders (20021 



We follow previous work in using the exchange algorithm to account for the intractable normalising 
constant, and give theoretical justification for the use of approximate exchange algorithms where MCMC as 
a substitute for exact simulation from the likelihood. In application, we find the use of the extended version 



of the exchange algorithm described in Murray et al. ( 2006 ) to be essential to ensure the MCMC can move 
freely in the problems we consider. 
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Appendix A: Target Distribution of Exchange Marginal PMCMC 

This appendix establishes that the EMPMCMC algorithm in section 2.2.3 of the paper has the desired target 
density of p{9\y). The proof requires the description of the extended target and proposal distribution used 
as a consequence of the use of the SMC sampler within the algorithm. 

For ease of exposition, we express the algorithm in a slightly different form to that in the main text, 
including the prior p{6) within the SMC sampler. The i'th iteration of the algorithm is then: 

1. Draw e* - q{.\e{i - 1)) and u - /(.jr). 

2. Run an SMC sampler on the x space, with the final (unnormalised) distribution as ttt{x) ~ p{9)j{x, y\9*) 

in order to obtain the particle approximation p( a; |0*, y) to the distribution p(x|0*, y) = p{9*)"f{x, y\6*)/Z{9*)p{9* , y) 
and an estimate (j){9* ,y) of its normalising constant, (j){9* ,y) :~ Z{9*)p{9* ,y). 

3. Sample a single point x* from p{.\9* , y). 

4. Let (^(i),6'*(i),a;(i),0(z),u(i)) == {9*,9{i - l),x* ,(t>{9* ,y),u) with probability: 

^{9*,y)q{9\9*h{u\9) 



lA 



<l){i-l)q{9*\9)j{u\9* 



otherwise set {9{i),9*{i),x{i),(l){i),u{i)) = {9{i- l),9*{i - l),x{i - l),(l){i - l),u{i- 1)). 

In advance of the proof we also make some definitions relating to the use of the SMC sampler (using 9 

rather than 9* throughout to simplify the notation). We define 7rf(xfe) = "i1{xk)/(t>k{9) (for k = 1,...,T) 

to be the sequence of targets used in the SMC sampler. In our case, we take '-f!^{xT) — p(9)j{xT,y\9) and 

4>t{9) — (t){9,y), so that TT^{xk) = p{x\9,y). The underlying construction of the SMC sampler is such that 

it targets the artificially constructed sequence of distributions 7r^(xi:fc) — J^{xi:k)/4'k{9) (for /c = 1, ...,T), 

with 

fc-i 

7fc(a;i:fe) = itixk) Yl L'^ji^j+i^^j)- 

The SMC sampler generates a weighted importance sample from each extended target TT^{xi.k) in succession. 
We denote the state of the p'th particle at the fc'th target by x^.f. (the joint state of all particles at the fcth 
target is denoted by xi-k). At the fc'th target, the particles are: resampled; moved using a transition kernel; 
then weighted. 

To describe the resampling step, we introduce the distribution r(afe_i|wj._i), with Wfc-i denoting the 
weights of the particles at target fc — 1 and ak-i = (afe_i, •••, o-k-i) with a^_i giving the index of the "parent" 
particle of "child" particle x^.f,. This operation can be interpreted as the process by which child particles at 
target k choose their parent particles from the population at target fc — 1. For the proof we also need to 
define, for fc = 1, ...,T and p — 1, ...,P, the index b^ which the ancestor particle of x^.rp at target fc had at 
that time. 

Let Mi{xi) be the initial proposal density and Mk{xk-i,Xk) for fc = 2,...,T be the transition kernels 
used at each target. The weighting step, applied to the p'th particle at the fc'th target after the resampling 
and move steps, finds the unnormalised weight w^ of the particle at the current target: 

li{xk)Ll_^(.Xk,x"t^') 



iLA^k-\')Mkix,'r:,x,) 

19 



The weight is then normahsed: 



An approximation to the target p{x\0,y) is then given by 

p 
p{dx\9,y) =^w^S^p,{dx) 
p=i 

where 6 is the Dirac delta function, with an estimate of its normahsing constant given by 



ly) = ll 



fc=i 



1 p 

p=i 



In advance of the theorem, we introduce the joint density of the variables generated by the SMC algorithm 
that uses P particles and T targets, defined on the space X'^p x {!,..., P} '■ 



T 



V'^(a:i, ...,XT,ai, ...,aT-i) := ip^{xi) J^ ip^ {ak-i\xk-i)'ip^ {xk\xk-i, ak-i) 



fc=2 
Vp=l / fc=2 \ J)=l 



( "fc-l T.P 



Theorem 1. Let O^ :— '^j^iM'^k = P) ^^ ^^^ number of offspring of particle p at target k. If for any 
p — 1, ...,P and k = 1, ..., T the resampling scheme satisfies 

(i.e., it is unbiased) and also that 

r{al == m\wk) = tw", (16) 

then the EMPMCMC algorithm is an MCMC algorithm targeting a joint distribution that admits p(6,x\y) 
as a marginal. 

Proof. The structure of the proof is as follows. We first fully describe the target and proposal densities used 
in the marginal PMCMC algorithm, then combine these with the density of the u variable generated for the 
exchange algorithm, and show that the EMPMCMC algorithm performs a deterministic "swap" move on this 
extended target. 

To begin, on the space £ := x X'^p x {1,...,P} , we define the proposal and target for 

a marginal PMCMC algorithm. To simply the notation, let E = (6,v,xi, ...,XT,ai, ...,aT^i) and E* = 
{9* , V* ,x\, ...,x'^^a\, ..., a^j^). The proposal is then 

q^{e,E*) := q{0*\9)w^'''tfj'^''{xl,...,x':j.,al,...,aT^i) 

where the weight w^ is present due to the sampling of the index v* to generate x* from the particle 
approximation p{dx\9,y). The target density is given by 
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;^N 



n-iE) 



pT 



^^(xi,...,XT,Qi,---,aT-i) 



which has the desired p{0,x\y) as a marginal. 

Now to derive the acceptance probability of the EMPMCMC algorithm, we express the algorithm in 
terms of a deterministic swap move on the following extended target: 



7r(£;, E*,u) := r^ {E)q'' {9, E*)j{u\0*)/Z{e*). 



Specifically, at each iteration of the algorithm, we apply the transformation T: £x£xX^i'£x£xX 
defined by 

T{E,E*,u) = {E*,E,u). 



The acceptance probability of this move is given by 



lA 



7riTiE,E*,u)) 



= lA 



n^{E*) q^{e*,E)-f{u\e)z{e*) 



tt{E*,E,u) ^"q^{9,E*) ^N{E) -i{u\e*)Z{e)' 

The term n^ (E) / q^ {9* , E) simplifies as follows: 



n^{E) 



1 

pT 



pi0,^^T\y)nL2LUi^f,xf_,) 



I'^io^^E) p%(,|,*)^^Mf (x^/) nL^K^Lii-.-OK (4hs4') 



pio,x^T\y)UL2LLA^f,x''l,) 



1 

pT 



q{9\9*)M (x'^) Yll^.Ml [xtl.J^) III 



P(^,^^l2/) 01=2 iLi(4"> 4-1 



pT llfc = l Z^p=l 



P ~p 



1, y)q{e\9*)M (x?^) nL2 K (4h^ . 4^) nLi 

p{e)i{xV,,y\9)Y{l^^Ll_,(xf,x'l,) 



^K 



(17) 



0(0, y) 



q{e\e*)M(xf) i^^Yil^^Ml(x'tl.x^ 






my) 



q{9\9*)p{y)Z{9)' 



(18) 



using equations (15) and (|16 



Now, substituting equation (18 1 into equation (17 1, we obtain 



Tf(T(E,E*,u)) 

lA \^ ' '—^ = lA 

7r(£'*,^,u 



lA 



^{9*,y) qi9\9*)p{y)Z{9)j{u\9)Z{0*) 
q{9*\9)p{y)Z{9*) 0(0,y) "f{u\9*)Z{9) 

^{9*,y)q{9\9*) j{u\9) 
My) 9(^10)7(^10*)' 



as required. 



a 
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Appendix B: Convergence of Approximate Algorithms 

In the appendix we prove the convergence of MCMC algorithms that take the following form. We note that 
the assumptions used for the proof are relatively strong, and are not widely applicable. However, it likely 
that similar (weaker) results exist under weaker assumptions: the results in this paper are intended as the 
first steps towards future work that would obtain results that hold more generally. 
Suppose that we have an "exact" MCMC algorithm, using transition kernel 

Ki{9,u),{9*,u*)) ^ q{0*\9)7rg,{u*)a{{0,u)AO*,u*)) + Se,ui^*,u*)r{d,u), 

where a{{9,u), (6*,u*)) is such that K is an MCMC kernel that has an invariant distribution of Tr{9,u) = 
■K{9)'Kg{u) and 

r{9,u) = l- f q{9*\9)Trg,{u*)a{{9,u),{9*,u*))d9*du*. 

The SAV method takes this precise form. The theorem below characterises the invariant distribution (where 
it exists), and the convergence rate, of the "approximate" MCMC algorithm given by the kernel 

KM{i9, u), {e*,u*)) = q{e*\9)L^,{vo,u*)ai{9, u), {e*,u*)) + 6e^^{e* ,u*)7{0, u), 

where Lgl{vQ, u*) represents M iterations of an MCMC kernel with invariant distribution tt^. (u*), beginning 
at an arbitrary fixed initial value vq G X and 

r{9,u) = l- f q{9*\9)Lf,{vo,u*)a{{9,u),{9\u*))d9*du*. 

J0',u' 

The same argument can be used the prove equivalent properties of the approximate exchange and ABC- 
MCMC algorithms described in the main text. In the exact versions of these algorithms the target distribu- 
tions and transition kernels have slightly different forms to that of the SAV method: 

• the exchange algorithm has the target distribution given in the main text, and the proposal additionally 
contains the deterministic "swap" move described in the main text; 

• the ABC-MCMC algorithm can be seen to target TT(9)TTg{u)TT^(u\y) (changing the notation to be con- 
sistent with that used in the proof), with the proposal taking the form q(9*\9)TTg<.{u*). 

These differences also result in a different acceptance probability to that used in the SAV algorithm, but 
have no impact on the structure of the proof of the theorem. 

Throughout the theorem and proof, ||.|| represents the total variation norm. 

Theorem 2. Suppose the Metropolis-Hastings kernel K is uniformly ergodic, i.e. there exists Ck G (0, oo) 
and pk G (0, 1), such that for any {9q, uq) Cz O x X and n G N 

\\K"{{9o,uo),.)-7t{.)\\<CkpI; 

Lg* is geometrically ergodic for all 9* G O, i.e. for all 9* G O, there exists Cl{vo,9*) G (0,oo) and 
Pl G (0, 1), such that for ng* -a.e. Wg G <-f and n G N 

\m,ivo,.)^7rg,{.)\\<CLivo)pl; 
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uniformly in 9* . Additionally, suppose that for some D > 0, supg g, q{9'\9) < D, where q is the proposal used 
in the kernel K , and for any M > 1 there exists a distribution ttm on O x X such that ttmKm — 7?m- 

Then for any e G (0, p^ — 1) there exists Mq G N, pk G {p, p(l + e)] C (p, 1) and C € (0, oo) such that 
for all M > Mo, {9q, uq) G 6 x A" and n>l. 



KU{9o,uo),.)-nM{.) 



Ik(.)-^M(.)II <Ck 



<crK, 



^- Pk 



(19) 
(20) 



Proof. To begin, for any 9,u E Q x X and n G N, 

n-l 

KIj{{9,u),.)-K-{{9,u),.) < Y.\\^M'-\iO,u),{K-KM){K^-n){.)) 



< Ck sup 



n-l 



K{{9,u),.)-KM{{e,u),.)\\Y^p'K 



< sup 

^ — Pk 9,u 



K{{9,u),.)-Km{{9,u),.)\ 



(21) 



We now bound the final term on the right hand side. We have: 



sup 



= sup sup 

e,u fs.t.\f\<i 

= sup sup 
e,« /s.t.|/|<i 



K{{9,u),.)-Km{{9,u),.) = sup sup I K{{9,u),{d9',du'))f{9',u')- I KM{{0,u),{d9',du'))f{9',u') 

[q{d9'\9)Tie'{du')a{{9, u), {d9\ du')) + 5e,u{d0\ du')r{9, u)] f{9' , u') 
- I [q{d9'\9)L^ {vo,du')a{{9,u), {d9' ,du')) + Sg^u{d9' ,du'y{9,u)] f{9',u') 

< sup sup f [q{d9'\9)a{{9,u),{d9',du'))f{9',u'){ng,{du')^L^{vo,du')}] 

0,u fs.t.\f\<l J 

+ sup sup \f{9,u){ri9,u)-¥{9,u)}\ 

0.U fs.t.\f\<l 

< sup sup f \q{d9'\9)\\a{{9,u),{d9',du'))\\f{9',u')\\T:g,{du')-L^{vo,du')\ 
e.u fs.t.\f\<iJ 

+ sup sup \f{9,u)\\r{9,u) — r{9,u)\ 
e.u fs.t.\f\<i 

< / D\TTg,{du')- LP{va,du')\ +sup|r(6i,u) -r{9,u)\ 
J e,u 

< DCLivo)p^ + sup\ri9,u)-r{9,u)\ 

e.u 

using the geometric ergodicity of Lg* for every 9* G Q. Using this property again, for the final term we have: 



s.\vp\r{9,u) — r{9,u)\ — sup 



q{9*\9)a{{9,u),{9*,u*)) 



{Lf,{vo,u*)-Tig.{u*))d9*du* 



< ClMp^sup 

0,u 

< cUvo)pf!. 



q{9*\9)a({9,u),{9* ,u*))d9*du* 
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Combining these results we obtain 



sup 

(e,«)eexA' 



K{{e, u), .) - Km{{0, u), .) <{D + 1)ClMp 



„M 



Using this result, and the uniform ergodicity of K, we obtain that for any (0, u), (i?, w) G 6 x A" 



Klj{{e,u),.)-Klj{{d,v),.) 



< 



+ \\KUie,u),.)-KU{^,v),.)\\ 



< 2- 



Ck 



^- Pk 



{D + l)CLiuo)p'^+CKPl. 



Now define 



PK ■■= PK xICk ( 1 + ^Pj^P^iD + l)Ci(«o)) < PK VC^ (l + 2^|^(i? + 1)Cl{vo. 



Choose e G (0, pK — 1) and let n G N be such that ^Ck < \/l + £ and Mi G N such that 

Ml -n 

1 + 2^f^^p + l)Ci(^>o) < VTT^. 



We then have that pK < Px(l + e) < 1 for A/ > Afi, and equation 19 follows. 
Then from equation [21] we have that for any n >\ 



a— 1 



i=0 



n-1 



< CK{D + l)CLiuo)pi'Y.PK 

Ck 



< 



i=0 



^- Pk 
So for any e > 0, we can chose M2 G N such that {D + 1)Cl(uo)Pl ^ < £■ so that 



7rKlj{{0,u),.)-nK^\i9,u),.) 



< 



Using llvr- ttmII = lini„- 



ttKI, ~ ^mKI 



and choosing A/o = A#i V M2 the proof is completed. 



a 



We note that the same argument may be used to obtain the same result where Lg* is allowed to use the 
value of u generated at the previous iteration, as long as Lg* is uniformly ergodic for every 6* G 8. 
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